This data set was prepared based on the Class, Architecture,
Topology and Homology (CATH) database for structural classification of proteins.
The sequences from the Protein Data Bank (PDB) were incorporated to reconstruct
entire chains. The curated dataset includes only single chain monomeric proteins with
maximum 40% sequence identity. Proteins that were not solved by X-ray diffraction methods
and were less than 30 amino acids in length were filtered out and the final
curated dataset contained 4752 unique sequences in four classes: mainly-α, mainly-β, mixed α+β and few secondary structure (fss) class.
Relevant Paper : Amino Acids 49 (2017) 261-271.